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WHEN DO STEPWISE ALGORITHMS MEET SUBSET 
SELECTION CRITERIA?! 

By Xiaoming Huo and Xuelei (Sherry) Ni 

Georgia Institute of Technology 

Recent results in homotopy and solution paths demonstrate that 
certain well-designed greedy algorithms, with a range of values of the 
algorithmic parameter, can provide solution paths to a sequence of 
convex optimization problems. On the other hand, in regression many 
existing criteria in subset selection (including Cp, AIC, BIG, MDL, 
RIG, etc.) involve optimizing an objective function that contains a 
counting measure. The two optimization problems are formulated as 
(PI) and (PO) in the present paper. The latter is generally combina- 
toric and has been proven to be NP-hard. We study the conditions 
under which the two optimization problems have common solutions. 
Hence, in these situations a stepwise algorithm can be used to solve 
the seemingly unsolvable problem. Our main result is motivated by 
recent work in sparse representation, while two others emerge from 
different angles: a direct analysis of sufBciency and necessity and a 
condition on the mostly correlated covariates. An extreme example 
connected with least angle regression is of independent interest. 

1. Introduction. We consider two types of optimization problem: 

• an optimization problem that is based on a counting measure, 
(PO) min lly — <l>a;||2 + Aq • ||x||o, 

X 

where <1> G W^^^,x G M"*,y G M", the notation || • ^ denotes the sum of 
squares of the entries of a vector, the constant Aq > is an algorithmic 
parameter and the quantity ||x||o is the number of nonzero entries in the 
vector x; 

• an optimization problem that depends on a sum of absolute values, 
(PI) min ||y — <l>x||2 + Ai • ||x||i, 
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where ||x||i = fo'^ the vector x = {xi,X2, ■ ■ ■ ,Xm)'^ and where the 

constant Ai > is another algorithmic parameter. 

Note that ||x||o (resp., is a quasi- norm (resp., norm) in M™. In the 

hteratm'e on sparse presentation, these are cahed the ^o-norm and the ii- 
norm, respectively. The notation (PO) and (PI) also appears in [8], with 
slightly different definitions. 

In subset selection under linear regression, many well-known criteria — 
including the Cp statistic, the Akaike information criterion (AIC), the Bayesian 
information criterion (BIC), minimum description length (MDL), the risk 
inflation criterion (RIC) and so on — are special cases of (PO), resulting from 
the assignment of different values to Aq. It is shown in this paper that prob- 
lem (PO) is, in general, NP-hard (Theorem 2.1). The NP-hardness has been 
known for many years, but to the best of our knowledge, no paper has for- 
mally presented a proof of this yet. 

At the same time, (PI), which has a long history that will be reviewed 
later, is the mathematical problem that is called upon in [45]. Recent ad- 
vances (details and references are provided in Section 2.2) demonstrate that 
some stepwise algorithms (e.g., [10, 38, 39]) reveal the solution paths of prob- 
lem (PI) while the parameter Ai takes a range of values. More importantly, 
most of these algorithms take only a polynomial number of operations (i.e., 
they are polynomial-time algorithms). In fact, the complexity of finding a 
solution path for (PI) is the same as that of implementing an ordinary least 
square fit [10]. 

The major objective of this paper is to find out when (PO) and (PI) give 
the same result in subset selection. A subset that corresponds to the nonzero 
subset of the minimizer of (PO) [resp., (PI)] is called a type-0 (resp., type-1) 
optimal subset with respect to Aq (resp., Ai). A subset that is both type-0 
and type-1 optimal is called a concurrent optimal subset. It is known that 
there is a necessary and sufficient condition for the type-1 optimal subset 
and that this condition can be verified in polynomial time. However, there is 
no polynomial-time necessary-and-sufficient condition for the type-0 optimal 
subset. We search for easily verified (i.e., polynomial-time) sufficient condi- 
tions for type-0 optimal subsets. When sufficient conditions are available, 
given solutions of (PI) by a stepwise algorithm, we can determine whether 
(PO) has been solved. The title of this paper refiects such an objective. 

The main contributions of this paper are two verifiable sufficient condi- 
tions for (PO), Theorems 3.1 and 3.2. The latter is an improved version of 
the former. Other conditions are generally standard and known. They are 
presented subsequently for the sake of completeness. 

The paper is organized as follows. Section 2 reviews the subset selection 
criteria that can be formulated as (PO), as well as the literature on (PI). 
Two cases are studied/reviewed. Section 3 contains the main results. Section 
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4 presents some associated conditions, which are either known or relatively 
easy to verify. Section 5 discusses some related work. A brief conclusion is 
provided in Section 6. Proofs are relegated to the appendices when conve- 
nient. 

2. Formulation and review of literature. We consider a regression set- 
ting. Let <1> G (^fi > m) denote a model matrix. Vectors x € M™" and 
y E M" are coefficient and response vectors. The columns of matrix $ are 
covariates. A regression model is y = + e, where e is a random vector. 
Let I = {1,2, . . . ,m} denote the set of indices of the coefficients. A subset 
of coefficients (or covariates) is denoted by O (O C I). Let denote the 
cardinality of the set $1. Let Xfi denote the coefficient vector that takes only 
nonzero values when the coefficient indices are in the subset To choose a 
subset ri, a subset selection problem has two competing objectives: (1) the 
residuals, y — ^x^, are close to zero and (2) the size of the set is small. 

2.1. Subset selection criteria and (PO). There exists an extensive body 
of literature on the criteria regarding subset selection. Miller [31], Burnham 
and Anderson [2] and George [19] all give excellent reviews. An interesting 
fact is that a majority of these criterion can be unified under (PO), where 
||y — ^x\\2 is the residual sum of squares [denoted by RSS(x)] and where 
the constant Aq depends on the criterion. Some well-known results are the 
Akaike information criterion (AIC) [1], Cp [20, 30], the Bayesian information 
criterion (BIG) [44], minimum description length (MDL) (see the equivalence 
between BIG and MDL in [25], Section 7.8), the risk inflation criterion (RIG) 
[15] and so on. We refer to George [19] for the details. In this paper, the 
"subset selection criteria" that appears in the title encompasses all of the 
foregoing criteria. 

Solving (PO) generally requires an exhaustive search of all subsets. When 
||x||o (i.e., the number of covariates) increases, the methods based on ex- 
haustive search rapidly become impractical. Innovative ideas have been de- 
veloped to reduce the number of subsets being searched; see [17, 32], as 
well as some later improvements, [18, 35, 36, 40, 41]. All of these methods 
adopt a branch-and-bound (B&B) strategy. Improvements can be achieved 
by modifying the structure in B&B or by applying stronger optimality tests. 
Despite these efforts, when the number of covariates (m) is moderately large 
(e.g., m = 50), the subset search cannot generally be carried out, unless the 
model matrix $ possesses some special structure. 

In fact, solving (PO) is an NP-hard problem! The following theorem can 
be considered as an extension of a result originally presented in [33]. The 
proof of the theorem appears in Appendix A.l. 



Theorem 2.1. Solving (PO) with a fixed Aq is an NP-hard problem. 
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2.2. Stepwise algorithms and (PI). Due to the difficulty of solving (PO), 
a relaxation idea has been proposed. The relaxation replaces the ^o-norm 
with the ^i-norm in the objective, which leads to (PI). 

Santosa and Symes [43] is considered the first modern appearance of the 
formulation (PI). The idea of relaxation has been studied extensively in the 
literature on sparse representation. Some representative papers are (roughly 
in chronological order) [4, 6, 7, 8, 11, 16, 22, 23, 29, 46, 47], and so on. A 
full review is well beyond the scope of this paper. The problem of sparse 
representation has a different emphasis, involving the derivation of a priori 
conditions instead of a posteriori conditions, as in the present paper. 

At the same time, (PI) has been proposed in the statistics literature as 
a method of subset selection. This has been termed the Lasso method [45]. 
An interesting recent development — least angle regression (LARS) [10] — 
demonstrates that certain stepwise algorithms can reveal the solutions to 
(PI) with varying values of Ai, based on the idea of homotopy (see [38]). 
More recent analysis further demonstrates that stepwise algorithms can lit- 
erally render the entire solution path in a large class of problems; see [24] 
and the references therein. The homotopy continuation method [39] and the 
subdifferential are the key technical tools in this development. [42] and [37] 
are useful references. 

2.3. Case studies. We present two cases that have been instructive to 

us. 

2.3.1. An extreme example. We construct an extreme example, in which 
a sophisticated stepwise algorithm (LARS) can miss an optimal subset. This 
example has played an inspirational role in our study of the equivalence 
conditions, which are discussed further in the next section. Details of the 
LARS algorithm can be found in [10], Section 2. We believe that this example 
is interesting in its own right. 

The example is constructed as follows. Let G M", i = 1, 2, . . . , m, denote 
the ith column of the model matrix <I>. Hence, $ = [(pi,(j)2, ■ ■ ■ , (pm]- Let 5i £ 
R", i = 1,2, . . . ,m, denote the Dirac vector taking 1 at the ith position and 
zero elsewhere. For i = m — A + l,m — A-\-2, . . . ,m, let (pi = 6i, where j4 is a 
positive integer. Consider a signal s = ■^jJ2iLm-A+i4'i- Obviously, in this 
case the optimal subset is {m — A + 1, . . . , m}. For the first m — A columns 
of let (pj = aj ■ s + bj ■ 6j, where 1 < j < m — A and a'^ + b'j = 1. Note that 
the (pi^s and s are all unit-norm vectors. Hereafter, for simphcity, we always 
assume that 1 < j <m — A and m — A + l<i<m. It is easy to verify that 

{s,(t)j)=aj and {s , (pi) = 1 / ^/~A. 
In this example, we choose 1 > ai > 02 > • • • > a-m-A > l/V^ > 0. 
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Theorem 2.2 (An extreme example). In the above example, the LARS 
algorithm chooses covariates 0i, (/>2, • ■ • , (pm-A one-at-a-time and by the same 
order in the first m — A steps. 

It takes some effort to verify the above theorem. We refer to the technical 
report [26], which is a longer version of this paper, as well as to the thesis 



Readers may notice that in the above example, the covariates are not 
standardized, while in the LARS algorithm, choosing covariates according 
to the inner product implies the standardization of covariates. A discussion 
in [26], Theorem 3.4, shows that this can be remedied by an orthogonal 
transformation. 

The foregoing example is developed in a fairly general form, with control- 
ling parameters A and m. To illustrate how dramatic this example can be, 
let us consider the case where ^ = 10 and m = 1,000,000. Based on the pre- 
vious description, the LARS algorithm will select the first 999,990 covariates 
before it selects any of the last ten covariates. At the same time, the optimal 
subset is formed by the last ten covariates. Another example regarding the 
performance of LARS can be found in [48], which has a different emphasis. 

This example is motivated by an early example in [4] , which can be traced 
further back to [3] and [5] in the analysis of some stepwise algorithms (e.g., 
orthogonal matching pursuit) in signal processing. Our example is similar 
in spirit; however it is different in constructional details. 

2.3.2. Subset selection with orthogonal model matrix. The following re- 
sult is well known: for an orthogonal model matrix when \/Ao = 
solutions to (PO) and (PI) have the same support. Moreover, at each po- 
sition the solutions differ by a constant Ai/2. A partial list of references 
for such a result includes [10, 38, 45], and many more. For readers who are 
familiar with soft-thresholding and hard-thresholding [9], this result should 
not come as a surprise. 

The above two examples collectively motivate us to pursue sufficient con- 
ditions that guarantee common support in the solutions of (PO) and (PI). 

3. Main results. A general sufficient condition for (PO) is derived. It is 
motivated by a recent approach which has appeared in applied mathematics; 
see [22]. We have modified their approach to solve a different mathematical 
problem. 

Recall that x £ denotes a coefficient vector. Denote the corresponding 
residual vector by e = y — ^x. Recah that y and $ E M"^"^ are the 

response vector and the model matrix, respectively. Let denote the support 
of the vector x:Q, = supp(2;). For an integer k>l, let 



[34]. 



.2 

min,fc 



= inf 



ll^^lli 



subject to ||5||o < k. 
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The above quantity reflects a certain property of the model matrix. Further- 
more, for a vector v G M"" and an integer k>l, we define 



c{v, k) 



\ 



i=l 



where \v[i) \ > \v(2) \ ^ ■•■ ^ |f^(n)l the nonincreasing-ordered magnitudes 
of the entries of the vector v. For finite k, we assume that the quantities 
c^($"^e, /e) and c^jn ^ have been computed. The following theorem provides 
a sufficient condition for a subset to be included in a type-0 optimal subset 
with respect to Aq. 

Theorem 3.1 (Main result 1). A subset of coefficients Q is given. Sup- 
pose that coefficient vector x is the minimizer of the function \\y — ^x\\2 
subject to supp(x) C ft. Let e = y — 

(1) // minjgf^ jxjl > then, with respect to Xq, there is no type-0 
optimal subset whose support is of size less than 

(2) Furthermore, i/minjgf^ \xi\ > q{\Q\), then, with respect to Xq, we have 
0, C n', where $7' is the type-0 optimal subset with respect to Aq. 

The quantities qi{-) and q{-) are defined as follows. For an integer k>l, 



c($^e, 1) + Jc^^^e, k + m) + {k- m)Aoa^i„,fc+^ 
qi{k) = sup 2 ' 

m<k ^m\n,k+m 



c($^e, 1) + Jc2(cDTe, k + m) + {k- m)Aoa^i,,fc+^ 
q2[k) = sup 2 

•m>k ^min,fc+m 

and 

q{k) = max{gi(/c),g2(^)}- 



The proof of this theorem appears in Appendix A. 2. 

Note that quantities gi(-) and q2{-) have the same objective function. 
However, the ranges of the variable m are different. Because qi{k) requires 
only a finite choice of the variable m (recall that m < k), it is computable. 
It is not straightforward to show that for any k>l, the quantity 52 (^) will 
exist. In this paper, we assume the existence of this quantity. 

Readers may compare the above with the test proposed in [22]. That test 
is related to the optimality in sparse representations. 

In Theorem 3.1, the quantities qi{-) and q{-) require multiple values of 
'''minfci ^ range of values of k. Compared to the quantities c{-,k), it is 
harder to compute the c^mfc'^- Inspired by the derivation in Theorem 2 of 
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[22], we derive a sufficient condition, which depends only on cr^jj^ where 

is the subset being tested. To state our result, the following quantity needs 
to be defined: for an integer m>l and a given integral constant M, let 

M 

A(??T,;M) = 1 ^ sup sup||^>J(/)fc||2, 

V"^ \X\<mk<^X 

where X is a subset of indices, \I\ denotes the size of this subset, the ma- 
trix $x is a submatrix of $ whose column indices form the set I, = 
($j$i)^^<I>J is the Moore-Penrose pseudo-inverse [21], with (•)* denoting 
the adjoint, and <j)k is the kth column (i.e., covariate) in $. Given m, the 
quantity A(m) can be computed by enumerating all m-subsets of the covari- 
ates. 

Now we present another sufficient condition. 

Theorem 3.2 (Main result 2). A subset of coefficients O is given. Sup- 
pose that coefficients vector x is the minimizer of the function \\y — ^x\\2 
subject to supp{x) C $1. Suppose it is known a priori that the size of the 
type-0 optimal subset is no larger than M . If mini > q'{\ft\,M), then the 
set il. is at least a subset of the type-0 optimal subset. Here, the quantity q'{-) 
is defined, for integer k>l and constant M , as 

q'ik,M) 

= sup U^^e, 1) + Jc^<!>Te, k) + Xo- ^^^^ • <,r.,k ' >'Hk; m)) 

l<m<M\ V [K-\-m)^ J 

See the proof in Appendix A. 3. 

If the model matrix $ is orthonormal, readers can verify that c^j^ y. = 

1 and X{m;M) = 1. This brings about significantly simplified criteria in 
Theorem 3.1 and Theorem 3.2. Compared with the case when the model 
matrix is orthogonal, the new criteria are less attractive. We consider this a 
price to be paid for the generality. 

The two results here focus on the type-0 optimal subset. Given a type-1 
optimal subset (which can be derived from some efficient algorithm), one 
can easily calculate the least square estimator according to it and use this 
estimator and subset to test whether the subset is also type-0 optimal. 

4. Other conditions of equivalence. In Section 4.1, we give a sufficient 
and necessary condition for a subset to be the concurrent optimal subset. 
Checking this condition cannot be achieved in polynomial time [recall that 
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(PO) is NP-hard]. In Section 4.2, we ask when the k most correlated covari- 
ates form the concurrent optimal subset. A sufficient condition is derived. 
This result is easy to check, but too restrictive. 

4.1. Sufficient and necessary conditions. Before moving on to the specific 
discussion, we introduce a sufficient and necessary condition for a concurrent 
optimal subset. Let Ii denote a subset of indices. Let $i and xi denote 
columns of $ and entries of x with indices from Ii. Let <I> = [$i $2]- Here, 
a permutation that does not change the problem is implied. The following 
can easily be verified. 

Lemma 4.1 [Sufficient and necessary condition for (PO)]. Ii is the opti- 
mal subset of (PO) if and only if the value 

(1) y^y - y^$i($f ^i)~^$f y + Ao • ||xi||o 

is the minimum of the objective in (PO). 

The following is well known (see, e.g., [29, 38, 47]). 

Lemma 4.2 [Sufficient and necessary condition for (PI)], h is the opti- 
mal subset of (PI) if and only if there exists a vector uj such that 

*^^=(S;)^'+(^T<"') 

holds and \\uj\\oo ^ Ai/2. 

The following can be easily derived from the above two lemmas. 

Corollary 4.3 [Sufficient and necessary condition (for concurrence)]. 
Ii is the concurrent optimal subset of (PO) and (PI) if and only if (1) 
and (2) are true. Moreover, with xq and xi the solutions of (PO) and (PI), 
respectively, we have 

(3) (2o - 2i)7i = («&f ^-i)"' • Y • sign((Si),J. 

For the equation above, consider 

ixo)i, = ($f $i)-i$f 2/ 

and 

^Jy = + Y • sign((xi)/J. 

By combining the above two, equation (3) follows. 

The above theorem gives a necessary and sufficient condition for a con- 
current optimal subset. Further comments follow. 
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Remark 4.4. Equation (3) provides a method for computing xi, given 
that xq is available and represents the optimal solution. Evidently, 

{xi)i, = (xo)/i - y ($f • sign((2;i)/J. 

Remark 4.5. Note that 

^{xq - Xi) = $l(5o - 

= ^.$iK$i)-i.sign((2;i),J, 

which is an equiangular vector among the columns of Hence, when op- 
timality is achieved in both (2) and (3), the difference between the two 
predicted vectors is an equiangular vector. 

4.2. A sufficient condition for mostly correlated covariates. We intro- 
duce a set of sufficient conditions which depend only on the correlations 
between the response y and the covariates <j)i, as well as the maximum 
correlation between the covariates. For simplicity, we assume that the re- 
sponse y and the covariates </){ are all standardized. It is not hard to see 
that \{y,4>i) \ < 1, i = 1,2, . . . ,m, and \{4>i,4>j)\ < 1, 1 < «,i < fri. Denote z = 
^^y = {zi, Z2, . . . , Zm)^ ■ Without loss of generality, we assume \zi \ > \z2\ > 
■ ■ ■ > \zm\- We want to find sufficient conditions such that the subset Ai = 
{(pi, (j)2, ■ ■ ■ , 4'k} is the solution to both (PO) and (PI): the k most correlated 
covariates (with the response) form the optimal subset. Clearly, an optimal 
subset does not need to consist of the most correlated covariates with the re- 
sponse. Due to this additional condition, this set of conditions is restrictive. 
The restrictiveness is illustrated in an example in Section 4.2 A. 

Denote 

At = max \{4>i,(l)j)\. 

l<t,j<m 

We have the following theorem. 

Theorem 4.6. For a given Aq and correlations zi, Z2, . . . , Zk, if the three 
conditions 

(4) [1 - {k - l)^^]zl > 2{k - 1)V + + {k- 



2 .^ {2k-\)ix 



k 



{2k-3)fi 



k 



^^^^°+ l + (fe-l)/. g"- 
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are satisfied, where A = n- n in (5), then the subset Ai is the type-0 optimal 
subset. 

To prove the above theorem, we can show that for subsets of size equal 
to k, greater than k, less than k, the above three conditions guarantee that 
subset Ai is the type-0 optimal subset. A detailed proof can be found in [26] 
or [34]. Anyone whose interest is restricted to (PO) should now be satisfied. 
The following is to establish a condition for concurrent optimality. 

Remark 4.7. Conditions (4), (5) and (6) are independent, that is, none 
of them can be derived from the other two. 

The following theorem states the condition for the set Ai = {(j)i,(p2, • • • i <^fc} 
to be the type-1 optimal subset; see the proof in [26] or [34]. 

Theorem 4.8. Given A and k, if 



(7) --\zk+i\> 



( A\^ 



2 i-'^+^i- l_(A;-l);x\ 
then subset A\ is the type-1 optimal subset. 

The following corollary gives a sufficient condition for Ai to be the con- 
current optimal subset. 

Corollary 4.9. Given conditions (4), (5), (6) and (7), subset Ai is 
the concurrent optimal subset. 

4.2.1. Restrictiveness of the aforementioned sufficient conditions. Read- 
ers may have noticed that the four conditions in the previous section are 
restrictive. One can easily find an example that does not satisfy these con- 
ditions, but which still has the concurrent optimal subset Ai. 

An example can be established as follows. Suppose that n, m and k are 
three positive integers satisfying n> m> k and n>m + k. Let Oj denote the 
iih. entry of vector a G Ml' with |ai| > \a2\ > ■ ■ > \ak\. Let Imxm S R™-^"^ 
be an identity matrix and G ^^^^ be the diagonal matrix with the ith 
diagonal entry being equal to Cj. Consider 

V 0(„_fc_m)xm / J *=1 

where "standardized{M}" refers to the standardization of all of the columns 
of the matrix M, the matrices Ofcx(m-A:) 0(n-fc-m)xm consist entirely 
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of zeros and is the ith column of The optimal solution consists of 
the first k covariates and these covariates have larger correlations with y. 
However, there are many choices of 111,71, k and the vector a with which 
condition (4) is not satisfied. As a special case, consider the following simple 
example: ?i = 10, m = 7, = 3 and a = (— 1 1 0)"^. It is not hard to verify 
that = 0.1667, Z3 = 0.7379, Z4 = -0.3162, [1 - (A; - l)fi]zl = 0.3630 and 
2{k - iffi + + (fc - l)fA = 0.9117. Hence, (4) does not hold in this 

case. 

5. Discussion. The question addressed in this paper has a unique aspect. 
We have the following application in mind: supposing a stepwise algorithm 
finds a path of type-1 optimal subsets, then given verifiable (polynomial- 
time) conditions that are derived in this paper, one knows whether a type-0 
optimal subset has been found. As mentioned earlier, our results potentially 
facilitate polynomial-time solutions to seemingly NP-hard problems. 

Our problem is different from that of analyzing statistical properties of 
the estimators. These properties include consistency, rate of convergence, 
asymptotic normality and so on. We found the oracle properties derived in 
[12] very interesting. However, Fan and Li [12] do not address whether their 
estimator — smoothly clipped absolute deviation penalty (SCAD) — can be 
computed in polynomial time. In fact, because of the possible exponential 
number of local optima, it is strongly believed that SCAD cannot be solved 
in polynomial time. Hence, an interesting question will be: when can one 
verify that SCAD is indeed solved by a polynomial-time algorithm? That is, 
we want to derive some sufficient conditions similar to those in the present 
paper. Note that Fan and Peng [14] give a fundamental description of when 
oracle properties (as well as other properties) are achievable, while a recent 
manuscript by Zou [50] proves the oracle property for a method that is 
rooted in the Lasso. 

As pointed out by an anonymous referee, there are two categories of equiv- 
alence conditions for (PO) and (PI): a priori conditions determine in advance 
when solving (PI) will identify a solution to (PO), while a posteriori con- 
ditions take a given subset of covariates (produced in any manner) and 
determine whether it is an optimal subset for (PO). The main results in this 
paper belong to the latter class. Given the target application described at 
the beginning of this section, it is not surprising that the latter is more in- 
teresting to us than the former. Moreover, a subset satisfying the former will 
most likely satisfy the latter, which implies that the a posteriori conditions 
are more powerful in the target application because they can identify more 
cases of equivalence. 

Subset selection has applications in feature selection. There are two ma- 
jor approaches in feature selection: filter and wrapper; see [27, 28, 32] for 
details. Our formulations are closely related to wrappers. A recent survey 
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paper by Fan and Li [13] gives an excellent overview of the statistical chal- 
lenges associated with high-dimensional data, including feature selection and 
feature extraction. Besides many contemporary applications, as summarized 
in [13], other applications are foreseeable. For example, subset selection is a 
critical problem in supersaturated design. A citation search of Wu [49] will 
provide most of the existing literature. A numerically efficient condition on 
the optimality of subsets has the potential to identify a good design. 

6. Conclusion. Stepwise algorithms can be numerically efficient, that is, 
polynomial-time. Specially designed stepwise algorithms can find type-1 op- 
timal subsets in subset selection. We have derived sufficient conditions to 
test whether these type-1 optimal subsets are also type-0 optimal. Such an 
approach allows polynomial-time algorithms to locate concurrent optimal 
subsets, which, otherwise, generally requires solving an NP-hard optimiza- 
tion problem. 



where all of the symbols are defined in (PO). It is evident that the point 
array (m,/(m)), m = 1,2,..., forms a nonincreasing curve in the positive 
quadrant. 

We first establish the existence of an integer ttiq, such that value /(mo) -|- 
Xqi^o minimizes the objective in (PO). Note that there are a finite number 
of m's such that Xqui < /(I) -|- Aq • 1. This inequality gives an upper bound 
on m's that satisfy f{m) + Xom < /(I) -|- Aq • 1- Among this finite number 
of m's, there is at least one mo that minimizes the value of the function 
/(m) + Aom. 

Define e = /(mo). In general, we can assume that e > because if e = 0, 
then the response y can be superposed by a small (more specifically, no more 
than mo) number of columns of the matrix which is a special case. 

Using the idea of the Lagrange multiplier, we can see that solving (PO) 
with Ao is equivalent to solving the sparse approximate solution (SAS) prob- 
lem in [33], Section 2, with e, which is proved in [33] to be NP-hard. Hence, 
in general, solving (PO) is NP-hard. 

A. 2. Proof of Theorem 3.1. Suppose that Q' is the type-0 optimal sub- 
set, with corresponding coefficient vector x' . We must have 
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A.l. Proof of Theorem 2.1. Let 




(8) 



y - ^x'Wl + Aollx'llo < ||y - ^x\\l + Ao||x||o. 
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Denoting 6 = x' — x, we have \\d\\o < \^\ + |^^'|- We will prove that 

(9) if < then 1 1 (5 1 loo <gi (5^) 
and 

(10) for any O', ||5||oo < g(J^). 
To see the above, a reformulation of (8) gives 

lk-^5|l2<lkll2 + Ao(l^^|-|f^'l), 

which is equivalent to 

(11) \ml<2{<^'^e,S)+\om-\n'\), 

where (•,•) denotes the inner product between two vectors. Define S' = 
(^Ln,\n\+m ■ Because > ll'^lli and (11) hold, we have 

\\6'\\l < 2{^^e, 6') + Xom - ■ ctL,|q|+|c'|- 
The above is equivalent to 

\\<^^e - 6'g < W^^eg + Xom - \n'\) ■ 

Define e* = The above inequality leads to 
The above immediately leads to 



Dividing both sides by cr^jj^ |r2|+|r2'|) have 
sup \5i\ 

(12) 



c($^£, 1) + ^c^i^^e, \n\ + m + Xom - m) ■ <^LnM+m 

— 2 

'^min,|n| + |n'| 

Recalling the definitions of qi{-) and q{-), (9) and (10) can be derived directly 
from (12). 

We are now able to verify item (1) of the theorem. Suppose that there is 
a type-0 optimal subset 0' satisfying < We have 

\x'i\ '>\xi\ — \xi — x'jl > |xj| — qi{0,) > 0. 

The second inequality is based on (9) and the last inequality follows from the 
condition in item (1). The above implies Qc^l' , which contradicts < 
We have proven item (1). 

The proof of item (2) is very similar to the proof of (1). We omit the 
obvious details. 
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A. 3. Proof of Theorem 3.2. The beginning of the proof is the same as 
that of the proof of the previous theorem. It begins to deviate at stage (11). 
For the reader's convenience, we restate inequahty (11): 

(13) ||cD<5||2<2(c^^e,5) + Ao(|0|-|0'|). 

Readers are referred to the previous proof for the meanings of the notation. 
First, we have 

n 

(14) (cl>^e,<5)<^|6(,)|-|%)|, 

i=l 

where \ > |5(2)| > • • • > \5{n) \ is the ordered hst of the magnitudes of the 
entries in the vector 5. Similarly, > |6(2)l > ••• > \b(n)\ is the ordered 
list of the magnitudes of the entries in the vector ^^e. We denote by 
b. The following manipulations are needed: 

R.H.S.of (14)=^|6(,)|-|%)|+ 

i=l i=|n|+l 
|f2| n 
<El^»|-|%)l + l^{|f^l+l)|- E 

i=l i=\n\+i 
M |Q/| \^\ 

(15) <El^»l-l%)l + lWi)l-w-El%)l 

i=i l^'l i=i 

<fi + ||)EN)N5»l 



i=l 



l + ^)(^-.*N>. 

where the vector takes the absolute values of 5 only at the positions 
where the vector 5 has the largest magnitudes and zeros elsewhere, that 
is, 



r|5,l, if |5,i>|5(|f,|)i. 



1^1'* \ 0, otherwise. 
For the vector 6*, 

5* = I if = ^{j) and > 

* \ 0, otherwise. 



Combining (14) and (15), we have 



(16) (cl>^e,5)< (l + |^j(6*,5f^|). 



MY 
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|2 

l25 



Meanwhile, for any fi* we have 

(17) ><in,m-\H'^^Hl 

where the set il.*'^ is the complement of the set W, and the matrices 
and <&o*c are submatrices of the matrix ^ formed by taking columns whose 
indices are in $7* and Q*"^, respectively. As mentioned earlier, the matrix 
is a pseudo-inverse of The vector 6n* (resp., takes only nonzero 

values when the index is in the set Q* (resp., Q*^). In the above steps, the 
first inequality holds because the matrix is a projection matrix. The 

second inequality is based on the definition of . The last step is a 

simple reformulation. 

Note that in (17), O* can be any subset of the indices. In the following, 
without loss of generality, we assume that the set Q* corresponds to the 
largest \ magnitudes in the vector 5, that is, = \ and 6n* = 6Jq^. We 
then have 



(18) 



> 

> 

> 
> 



>x{\n 



E 



■ sup 



E l-^wl • sup ||$^*( 



k=\n\+i 

\n'\ 



]n\\\i ■ sup 11$^* 

em* 



\n 



\^*n\\\2 ■ sup 



\n\\\2- 



In the above, the first and second steps are common manipulations. The 
third inequality takes fi* to be the subset of indices where Jy^^n has nonzero 
entries. The fourth inequality is based on pr^i ||i/|J7| > J2k=\n\+i 



The fifth inequality is based on Hf^pQiHi < v |^^| • Pinilb- The last step recalls 



\n\\ 



the definition of A(-, •). Combining (17) and (18), we have 

(19) ||$<5||i>aL.,|f^|-A2(|17|;|17'|).||<5f^|||2. 

We now combine the above results and then maneuver back to the ar- 
gument in the proof of Theorem 3.1. Combining (13), (16) and (19), we 
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have 

Ji \2noi.in'i\ iix* i|2^r,/i , l^'l 



• ■ ii5f^,| 11^ < 2( 1 + ^ ) (6*, <5fn|) + xom - \n'\). 



Let 



We have 



The above is equivalent to 
which leads to 



loo < linioo + ^c2(6*, |0|) + Ao • '^ll'^l^'iJ,^^'^ • cx^^^i^i • A^dl^l; l^^'l). 



Recalling the definitions of 6' and b* , we have 
||5||oo< fc(cl>^e,l) 



(20) 

,^^-L.|.rA^(l^l;l^'l)) 

<g'(|f)|;M). 

The above is equivalent to ||a; — x'||oo < q'{\^\]M). Using the same argument 
as in the last proof, we can argue that O C 0'. Supposing Xi ^ 0, we have 

\x'i\ > \xi\ — \xi — x'jl > \xi\ — q'{\il\,M) > 0, 

which implies that QcQ' . 
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